Chlorofun: Predicting Chla-A forecast @ BARC.

First we load in the Rlibraries and set up our team name according to the forecasting challenge

##Load libraries
library("plotrix")
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(scoringRules)
library(neonUtilities) 
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.1
## Loaded modules: basemod,bugs
library(daymetr)
library(neon4cast)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ dplyr::combine()         masks randomForest::combine()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ ggplot2::margin()        masks randomForest::margin()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ lubridate::union()       masks base::union()
library(rnoaa)
## Registered S3 method overwritten by 'hoardr':
##   method           from
##   print.cache_info httr
library(dplyr)

##Set up team name
team_name <- "chlorofun"
team_list <- list(list(individualName = list(givenName = c("Kayla","Nick","Stacy","Zhuoran") ,
                              surName = c("Anderson","Andrysiak","Mowry","Yu"),
                               organizationName = "University of Notre Dame",
                                       electronicMailAddress = "smowry@nd.edu")))
Throughout this semester, our overarching goal was to apply Bayesian principles of statistical inference to the burgeoning field of ecological forecasting. We sought to gain insights into the mechanisms behind aquatic ecosystem function, through forecasting chlorophyll a (ChlA) in relation to various drivers, or external effectors. We made initial null models–a random walk that predicted ChlA today is the same as ChlA yesterday, with a little bit of uncertainty–for Prairie Pothole, a small lake in North Dakota. Given the timing of our forecast, namely that it would be complete in the winter months, we changed our site of interest to Lake Barco in Florida. The goal was to forecast in a warmer climate so that we could have more relevant forecasts right now, and so that we had a better idea of if our predictions were accurate. 

Our motivation behind forecasting ChlA in particular, as opposed to other terrestrial or aquatic measures, was a bit arbitrary. Most of us did not come from an aquatic or terrestrial ecology background, so we chose ChlA since we all had at least a tangential familiarity with it. We have found that it is an important measure of algae and phytoplankton abundance, as well as an indirect measure of pollutants. It has significant implications in the health of an aquatic ecosystem, so we stuck with this parameter for our forecast. 

Our first task was acquiring data and visualizing it. This involved plotting out our target, ChlA, against time, along with plotting potential covariates, air temperature and air pressure, both in time-series, against one another, and against ChlA. The purpose of this was to gain a visual sense of the data, to detect any cyclic trends in any of our data, targets or drivers, over time, potential for covariance among our drivers, or to estimate the relationship between each driver and ChlA. From those plots, we could see that ChlA was consistently fairly low, with the exception of one brief, but quite large, spike in concentration. Additionally, Temperature and Pressure seemed to have a cyclic pattern, changing during the year in a repetitive way. In the case of temperature specifically, there seemed to be much more chaos, or unpredictable randomness, in the values in winter months compared to summer months. In the summer, temps were consistently high, but in the winter, they ranged greatly from cool to moderate values. This was not in itself concerning, but did indicate that that could be a potential source of uncertainty in our drivers, which would need to be factored into our models.

The code below downloads and plots the target variable, chlorophyl-a, and sets the forecast date.

##Set forecast date
forecast_date <- Sys.Date()
noaa_date <- Sys.Date() - days(1)  #Need to use yesterday's NOAA forecast because today's is not available yet

##Load target data -- Chla-A @ BARC
download_targets <- function(){ readr::read_csv("https://data.ecoforecast.org/neon4cast-targets/aquatics/aquatics-targets.csv.gz", guess_max = 1e6) }
target<-download_targets()
## Rows: 105950 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): site_id, variable
## dbl  (1): observation
## date (1): datetime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
target$year <- year(target$datetime)

##subset target and location
chla <- subset(target,site_id=="BARC"& variable=="chla"&year>"2020")


##Plot target data
chla.plot <- ggplot(chla, aes(x= datetime, y = observation)) + 
  geom_point()+ 
  theme_bw()+ theme(text = element_text(size=10)) + theme(legend.position = "none")+
  ylab("chla (mg/L)")+xlab("Year")+ labs(subtitle = "Chlorophyll-a")
chla.plot
## Warning: Removed 154 rows containing missing values (geom_point).

chla.plot.zoomed <- ggplot(chla, aes(x= datetime, y = observation)) + 
  geom_point()+ 
  theme_bw()+ theme(text = element_text(size=10)) + theme(legend.position = "none")+
  ylab("chla (mg/L)")+xlab("Year")+ labs(subtitle = "Chlorophyll-a")+ylim(0,10)
chla.plot.zoomed
## Warning: Removed 160 rows containing missing values (geom_point).

Next we download the predictor variables: air temperature and air pressure.

##Download past noaa data
df_past <- neon4cast::noaa_stage3()
## establishing connection to stage3/parquet at data.ecoforecast.org ...
## connected! Use dplyr functions to filter and summarise.
## Then, use collect() to read result into R
##subset air temperature at BARC

noaa_past_temp <- df_past |> 
  dplyr::filter(site_id =="BARC",
                variable == "air_temperature") |> 
  dplyr::rename(ensemble = parameter) |> 
  dplyr::collect()

#subset air pressure at BARC

noaa_past_pressure <- df_past |> 
  dplyr::filter(site_id =="BARC",
                variable == "air_pressure") |> 
  dplyr::rename(ensemble = parameter) |> 
  dplyr::collect()


##Download noaa future data

df_future <- neon4cast::noaa_stage2()
## establishing connection to stage2/parquet at data.ecoforecast.org ...
## connected! Use dplyr functions to filter and summarise.
## Then, use collect() to read result into R
## Subset air temperature at BARC

noaa_future_temp <- df_future |>
  dplyr::filter(site_id == "BARC",
                reference_datetime == as.Date(noaa_date),
                datetime >= lubridate::as_datetime(forecast_date),
                variable == "air_temperature") |>
  dplyr::rename(ensemble = parameter) |>
  dplyr::select(datetime, prediction, ensemble) |>
  dplyr::collect()

##subset air pressure at BARC

noaa_future_pressure <- df_future |>
  dplyr::filter(site_id == "BARC",
                reference_datetime == as.Date(noaa_date),
                datetime >= lubridate::as_datetime(forecast_date),
                variable == "air_pressure") |>
  dplyr::rename(ensemble = parameter) |>
  dplyr::select(datetime, prediction, ensemble) |>
  dplyr::collect()

##aggregate past air temperature to be daily

noaa_past_temp_daily<- noaa_past_temp |> 
  dplyr::mutate(date = as_date(datetime)) |> 
  dplyr::group_by(date) |>
  dplyr::summarize(air_temp = mean(prediction- 273.15, na.rm = TRUE), .groups = "drop") |> 
  dplyr::rename(datetime = date)

##aggregrate past pressure to be daily

noaa_past_pressure_daily<- noaa_past_pressure |> 
  dplyr::mutate(date = as_date(datetime)) |> 
  dplyr::group_by(date) |> 
  dplyr::summarize(air_pressure = mean(prediction, na.rm = TRUE), .groups = "drop") |> 
  dplyr::rename(datetime = date)

##aggregate future temperature to be daily

noaa_future_temp_daily<- noaa_future_temp |> 
  dplyr::mutate(date = as_date(datetime)) |> 
  dplyr::group_by(date) |> 
  dplyr::summarize(air_temp = mean(prediction- 273.15, na.rm = TRUE), .groups = "drop") |> 
  dplyr::rename(datetime = date)

##aggregate future pressue to be daily

noaa_future_pressure_daily<- noaa_future_pressure |> 
  dplyr::mutate(date = as_date(datetime)) |> 
  dplyr::group_by(date) |> 
  dplyr::summarize(air_pressure = mean(prediction, na.rm = TRUE), .groups = "drop") |> 
  dplyr::rename(datetime = date)

##create a year variable for subsetting
noaa_past_temp_daily$year <- year(noaa_past_temp_daily$datetime)

##susbset to only include data after 2020
noaa_past_pressure_daily$year <- year(noaa_past_pressure_daily$datetime)

noaa_past_temp_daily <- subset(noaa_past_temp_daily,year>"2020")
noaa_past_pressure_daily <- subset(noaa_past_pressure_daily,year>"2020")

#temperature and pressure from Jan.1 2020 to 35 days in the future
noaa_past_pressure_daily$year <- NULL
noaa_past_temp_daily$year <- NULL

##bind drivers together 
air_pressure_all <- rbind(noaa_past_pressure_daily,noaa_future_pressure_daily)
air_temp_all <- rbind(noaa_past_temp_daily,noaa_future_temp_daily)

##convert date time to a character to merge
air_pressure_all$datetime<-as.character(air_pressure_all$datetime)
air_temp_all$datetime <- as.character(air_temp_all$datetime)

##merge drivers based on date time
drivers <- merge(air_pressure_all,air_temp_all,by="datetime")
drivers$datetime <- as.Date(drivers$datetime)

##create columns (for plotting)
drivers$air_temp_past <- drivers$air_temp
drivers$air_pressure_past <- drivers$air_pressure
drivers$air_temp_future <- drivers$air_temp
drivers$air_pressure_future <- drivers$air_pressure

#truncated historical and future to show better on plots
drivers$air_temp_future[1:length(noaa_past_temp_daily$air_temp)] <-NA
drivers$air_temp_past[(length(noaa_past_temp_daily$air_temp)+1):length(drivers$datetime)] <-NA

drivers$air_pressure_future[1:length(noaa_past_pressure_daily$air_pressure)] <-NA
drivers$air_pressure_past[(length(noaa_past_pressure_daily$air_pressure)+1):length(drivers$datetime)] <-NA

#plot drivers

ggplot(drivers, aes(x= datetime, y = "")) + 
  geom_point(aes(y = air_pressure_past, col = "historical")) + 
  geom_point(aes(y = air_pressure_future, col = "predicted"))+
  theme_bw()+
  theme(legend.title=element_blank())+
  ylab("Pressure (Pa)")+xlab("Year")+ labs(subtitle = "Daily Mean Pressure")
## Warning: Removed 35 rows containing missing values (geom_point).
## Warning: Removed 713 rows containing missing values (geom_point).

ggplot(drivers, aes(x= datetime, y = "")) + 
  geom_point(aes(y = air_temp_past, col = "historical")) + 
  geom_point(aes(y = air_temp_future, col = "predicted"))+
  theme_bw()+
  theme(legend.title=element_blank())+
  ylab(expression("Temperature ("*~degree*C*")"))+xlab("Year")+ labs(subtitle = "Daily Mean Temperature")
## Warning: Removed 35 rows containing missing values (geom_point).
## Removed 713 rows containing missing values (geom_point).

Next we will create a data frame with blank future values that we plan to forecast:

###Set up data for JAGS model 
#make a new dataframe to add in future 35 days
dat <- chla

time <- as.Date(dat$datetime)
new_dates<-c(forecast_date,forecast_date + 1,forecast_date + 2,forecast_date +   3,forecast_date + 4,
               forecast_date + 5,forecast_date + 6,forecast_date + 7,forecast_date +   8,forecast_date + 9,
               forecast_date + 10,forecast_date + 11,forecast_date +   12,forecast_date + 13,forecast_date + 14,
               forecast_date + 15,forecast_date + 16,forecast_date +   17,forecast_date + 18,forecast_date + 19,
               forecast_date + 20,forecast_date + 21,forecast_date +   22,forecast_date + 23, forecast_date + 24,
               forecast_date + 25,forecast_date + 26,forecast_date +   27,forecast_date + 28,forecast_date + 29,
               forecast_date + 30,forecast_date + 31,forecast_date +   32,forecast_date + 33,forecast_date + 34,
                forecast_date +35)

time <- append(time, new_dates)
y <- as.vector(dat$observation)
y <- append(y, rep(NA,36))

y<-log(y)

temp<-drivers$air_temp
pressure<-drivers$air_pressure
last.id<-(length(y)-36)

Once our data were downloaded, our first task was to create a null model. This simple model would predict ChlA based on nothing but the historical measurements of it. The random walk model we employed was built with the understanding that ChlA is unchanging, give or take some uncertainty in the data we had. The point of this model is not to have predictive power, but to have a solid starting point upon which further models can be made and also a comparison for newer models to test if they are better at predicting than the null model with process uncertainty, observation uncertainty, initial condition uncertainty, and parameter uncertainty.

For all of our models, they were run through R with the JAGS package. This used Monte Carlo methods to sample from the data and make future predictions. The conditions used to establish this were 1000 iterations, from 3 different chains. Each chain had unique initial conditions. After each run of our model, we had to assess convergence of the chains, which indicated that the formation of the model was good and useful. All of our models converged in the predictions of our target and drivers.

The code below runs a RandomWalk model and plots the trace plots to ensure convergence of our paramters of interest:

##JAGS models
#Random Walk Timeseries Model
RandomWalk = "
model{
  
  #### Data Model
  for(t in 1:n){
    y[t] ~ dnorm(x[t],tau_obs) ##Observation uncertainty
  }
  
  #### Process Model
  for(t in 2:n){
    x[t]~dnorm(x[t-1],tau_add) ##Process uncertainty
  }
  
  #### Priors
  x[1] ~ dnorm(x_ic,tau_ic) ##Initial condition uncertainty
  tau_obs ~ dgamma(a_obs,r_obs) ##Parameter uncertainty
  tau_add ~ dgamma(a_add,r_add) ##Parameter uncertainty
}
"
#Define the data
data <- list(y=y,n=length(y),      
             x_ic=-1,tau_ic=3,     
             a_obs=1,r_obs=1,     
             a_add=1,r_add=1)

#Define MCMC Chains -- 
nchain = 3
init <- list()
for(i in 1:nchain){
  y.samp = sample(y,length(y),replace=TRUE)
  init[[i]] <- list(tau_add=1/var(diff((y.samp))), 
                    tau_obs=5/var((y.samp)))        
}

##Create JAGS model object
j.model.rw   <- jags.model (file = textConnection(RandomWalk),
                         data = data,
                         inits = init,
                         n.chains = 3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 558
##    Unobserved stochastic nodes: 940
##    Total graph size: 1505
## 
## Initializing model
##Draw samples from model
jags.out.0   <- coda.samples (model = j.model.rw,
                            variable.names = c("tau_add","tau_obs","x"),
                            n.iter = 1000)


##Look at trace plots for tau_add and tau_obs
plot(jags.out.0[,1:2])

We used a random walk model as our null model, which mechanistically assumes that the next time step in the time series comes from a normal distribution centered around the precious time point. Because the majority of the chlorophyl-A measurements were so low, however, this model predicted negative ChlA by chance, which is impossible. Given we wanted to use a mechanistic random walk model, we were unable to change the distribution associated with the process model to a strictly positive distribution. Similarly, truncating the normal distribution within the process model to force all values below 0 to 0, artificially lowers our model uncertainty. Instead, to solve this issye, we log transformed our chlorophyll data. Therefore, in all future analyses we were forecasting the natural logarithm of ChlA (log ChlA), instead of ChlA outright. This transformation of our data ensured that even if the model predicted negative values, due to random deviation of the random walk model, they would be back transformed into positive values and therefore be logical for chlorophyl-a.

Below is our forecast of log(ChlA) based on the random walk model. You’ll notice we have a handful of missing data points, indicated by the ballooning uncertainty.

##Plot forecast

#set time length
time.rng = c(1,length(time))  
#make a matrix of JAGS samples
out0 <- as.matrix(jags.out.0)  
#Save time series predictions only
x.cols.0 <- as.data.frame(out0[,3:(length(y)+2)]) 
#calculate quantiles 
ci0 <- apply(x.cols.0,2,quantile,c(0.025,0.5,0.975)) 

#Plot median
plot(time,ci0[2,],type='l',ylab="log(Chla-a)",ylim=c(-10,5),xlim=time[time.rng], main = "Random Walk")

#Add in uncertainty
ecoforecastR::ciEnvelope(time,ci0[1,],ci0[3,],col=ecoforecastR::col.alpha("lightBlue",0.75))
points(time,y,pch="+",cex=0.5)

Our next goal was to generate a model with drivers: external factors that may affect ChlA abundance. Ecologically, we assumed that air temperature would play a role in phytoplankton and algal abundance, and that abundance would directly relate to ChlA values. So the inclusion of air temperature could give us insights into ChlA concentrations. We also combined air pressure with the model, to see if it had an impact on ChlA.

Code for models with one and 2 driver variables is below: Note: we did also look at the trace plots for convergence, however, due to run time, we have commented them out for the final report.

#Single Covariate Model
Single_Cov = "
model{
  
  #### Data Model
  for(t in 1:n){
    y[t] ~ dnorm(x[t],tau_obs)
  }
  
  #### Process Model
  for(t in 2:n){
 
   mu[t] <- x[t-1] # + betaX*x[t-1] + betaTemp*Temp_a[t]
    x[t]~dnorm(mu[t],tau_add) 
    Temp_a[t] ~ dnorm(Temp[t],tau_driv)
  }
  
  #### Priors
  x[1] ~ dnorm(x_ic,tau_ic)
  tau_obs ~ dgamma(a_obs,r_obs)
  tau_add ~ dgamma(a_add,r_add)
  tau_driv ~ dgamma(a_driv,r_driv)
  betaX ~ dnorm(0,0.1)
  betaTemp ~ dnorm(0,0.1)
}
"
#Define the data
data_cov <- list(y=y,n=length(y),Temp=temp,      
             x_ic=-1,tau_ic=3,     
             a_obs=1,r_obs=1,     
             a_add=1,r_add=1 ,
             a_driv=1,r_driv=1
             
)

#Define MCMC Chains -- 
nchain = 3
init_2 <- list()
for(i in 1:nchain){
  y.samp = sample(y,length(y),replace=TRUE)
  init_2[[i]] <- list(tau_add=1/var(diff((y.samp))), 
                    tau_obs=5/var((y.samp)),
                    tau_driv=5/var((y.samp)))        
}

j.model.temp   <- jags.model (file = textConnection(Single_Cov),
                         data = data_cov,
                         inits = init_2,
                         n.chains = 3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 558
##    Unobserved stochastic nodes: 1690
##    Total graph size: 3007
## 
## Initializing model
jags.out.1   <- coda.samples (model = j.model.temp,
                            variable.names = c("betaX", "betaTemp","tau_add","tau_obs","tau_driv","x"),
                            n.iter = 1000)

time.rng = c(1,length(time))     
out1 <- as.matrix(jags.out.1)        
x.cols.1 <- as.data.frame(out1[,6:(length(y)+5)]) 
ci1 <- apply(x.cols.1,2,quantile,c(0.025,0.5,0.975)) 

##Look at trace plots
plot(jags.out.1[,1:5])

##Plot forecast

plot(time,ci1[2,],type='l',ylim=c(-10,4),ylab="log(Chla-a)",xlim=time[time.rng], main = "Model with Air Temperature")

ecoforecastR::ciEnvelope(time,ci1[1,],ci1[3,],col=ecoforecastR::col.alpha("lightBlue",0.75))
points(time,y,pch="+",cex=0.5)

#Double Covariate Model
Double_Cov = "
model{
  
  #### Data Model
  for(t in 1:n){
    y[t] ~ dnorm(x[t],tau_obs)
  }
  
  #### Process Model
  for(t in 2:n){
 
   mu[t] <- x[t-1] # + betaX*x[t-1] + betaTemp*Temp_a[t] + betaPressure*Pressure_a[t]
    x[t]~dnorm(mu[t],tau_add) 
    Temp_a[t] ~ dnorm(Temp[t],tau_driv)
    Pressure_a[t] ~ dnorm(Pressure[t],tau_driv2)
  }
  
  #### Priors
  x[1] ~ dnorm(x_ic,tau_ic)
  tau_obs ~ dgamma(a_obs,r_obs)
  tau_add ~ dgamma(a_add,r_add)
  tau_driv ~ dgamma(a_driv,r_driv)
  tau_driv2 ~ dgamma(a_driv2,r_driv2)
  betaX ~ dnorm(0,0.1)
  betaTemp ~ dnorm(0,0.1)
  betaPres ~ dnorm(0,0.1)
}
"
#Define the data
data_cov2 <- list(y=y,n=length(y),Temp=temp,Pressure=pressure,    
                 x_ic=-1,tau_ic=3,     
                 a_obs=1,r_obs=1,     
                 a_add=1,r_add=1 ,
                 a_driv=1,r_driv=1,
                 a_driv2=1,r_driv2=1
                 
)

#Define MCMC Chains -- 
nchain = 3
init_3 <- list()
for(i in 1:nchain){
  y.samp = sample(y,length(y),replace=TRUE)
  init_3[[i]] <- list(tau_add=1/var(diff((y.samp))), 
                    tau_obs=5/var((y.samp)),
                    tau_driv=5/var((y.samp)),
                    tau_driv2=5/var((y.samp)))        
}

j.model.temp.pres   <- jags.model (file = textConnection(Double_Cov),
                              data = data_cov2,
                              inits = init_3,
                              n.chains = 3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 558
##    Unobserved stochastic nodes: 2439
##    Total graph size: 4506
## 
## Initializing model
jags.out.2   <- coda.samples (model = j.model.temp.pres,
                            variable.names = c("betaX","betaTemp",
                            "betaPres","tau_add","tau_obs","tau_driv","tau_driv2","x"),
                            n.iter = 1000)

##look at trace plots
plot(jags.out.2[,1:7])

time.rng = c(1,length(time))     
out2 <- as.matrix(jags.out.2)        
x.cols.2 <- as.data.frame(out2[,8:(length(y)+7)]) 
ci2 <- apply(x.cols.2,2,quantile,c(0.025,0.5,0.975)) 

plot(time,ci2[2,],type='l',ylim=c(-10,4),ylab="log(Chla-a)",xlim=time[time.rng], main = "Model with Air Temperature and Pressure")

ecoforecastR::ciEnvelope(time,ci2[1,],ci2[3,],col=ecoforecastR::col.alpha("lightBlue",0.75))
points(time,y,pch="+",cex=0.5)

After creating this model, it could be compared to our null model to get a sense for how much is actually gained from adding information about these two parameters to our predictions. To do this we performed model assessment by calculating a range of statistics on our training data: RMSE, correlation, slope, and bias. Ideally, we would do this on out-of-sample data, when future data become available.

##Model assessment of all models

#Calculate accuracy statistics
means0<-apply(x.cols.0,2,mean)
means1<-apply(x.cols.1,2,mean)
means2<-apply(x.cols.2,2,mean)

#bind predictions together and remove NAs
preds<-data.frame(y,means0,means1,means2)
preds<-na.omit(preds)
colnames(preds)<-c("Data","RW","Temp","TempPres")

#fit linear model to calculate slope
RW.fit = lm(preds$Data ~ preds$RW)
Temp.fit = lm(preds$Data ~ preds$Temp)
TempPres.fit = lm(preds$Data ~ preds$TempPres)

#create a dataframe to hold assessment stats
stats = as.data.frame(matrix(NA,4,3))
rownames(stats) <- c("RMSE","Bias","cor","slope")
colnames(stats) <- c("Random Walk","Temperature", "Temperature and Pressure")
stats["RMSE",'Random Walk'] = sqrt(mean((preds$Data-preds$RW)^2))
stats["RMSE",'Temperature']  = sqrt(mean((preds$Data-preds$Temp)^2))
stats["RMSE",'Temperature and Pressure']  = sqrt(mean((preds$Data-preds$TempPres)^2))
stats['Bias','Random Walk'] = mean(preds$RW-preds$Data)
stats['Bias','Temperature']  = mean(preds$Temp-preds$Data)
stats['Bias','Temperature and Pressure']  = mean(preds$TempPres-preds$Data)
stats['cor','Random Walk']  = cor(preds$RW,preds$Data)
stats['cor','Temperature']   = cor(preds$Temp,preds$Data)
stats['cor','Temperature and Pressure']   = cor(preds$TempPres,preds$Data)
stats['slope','Random Walk'] = coef(RW.fit )[2]
stats['slope','Temperature']  = coef(Temp.fit)[2]
stats['slope','Temperature and Pressure']  = coef(TempPres.fit)[2]

#print our a formatted table 
knitr::kable(stats)
Random Walk Temperature Temperature and Pressure
RMSE 0.0909734 0.0901823 0.0905241
Bias -0.0010670 -0.0010827 -0.0008061
cor 0.9977063 0.9977480 0.9977290
slope 1.0123622 1.0124131 1.0123204

Further we plotted the predicted values against the true values, ideally we would like these to be as close to the 1-1 line as possible.

##Model 1 -- Plot mean predicted values vs data
plot(preds$RW,preds$Data,pch=".",xlab= "Predicted",ylab ="Data",main="Random Walk")
#one-to-one line
abline(a=0, b=1,col="red",lwd=2)
#slope
abline(RW.fit,col=3,lwd=3,lty=2)
legend("bottomright",legend=c('obs','1:1','reg'),col=1:3,lwd=3)

##Model 2 -- Plot mean predicted values vs data
plot(preds$Temp,preds$Data,pch=".",xlab= "Predicted",ylab ="Data",main="Temperature Model")
abline(a=0, b=1,col="red",lwd=2)
abline(Temp.fit,col=3,lwd=3,lty=2)
legend("bottomright",legend=c('obs','1:1','reg'),col=1:3,lwd=3)

##Model 3 -- Plot mean predicted values vs data
plot(preds$TempPres,preds$Data,pch=".",xlab= "Predicted",ylab ="Data",main="Temperature and Pressure Model")
abline(a=0, b=1,col="red",lwd=2)
abline(TempPres.fit,col=3,lwd=3,lty=2)
legend("bottomright",legend=c('obs','1:1','reg'),col=1:3,lwd=3)

We additionally created a Taylor diagram to simultaneously visualize how well each model estimates the data as well as the variance within the data. All of the models are so similar that you cannot tell that they are all plotted on the Taylor Diagram.

##Create Taylor Diagram
taylor.diagram(ref=preds$Data,model=preds$RW,normalize=TRUE,ref.sd=TRUE)
taylor.diagram(ref=preds$Data,model=preds$Temp,normalize=TRUE,ref.sd=TRUE,add= TRUE)
taylor.diagram(ref=preds$Data,model=preds$TempPres,normalize=TRUE,ref.sd=TRUE,add= TRUE,col=5)
legend("topright",legend=c("Random Walk","Temp","TempPres"),col=2:5,pch=20,cex=0.4)

Finally, we calculate Bayesian p-values, which tell us how often the actual data is more ore less extreme than our ensemble predictions. Our Bayesian p-values are decent, which is evident because our histogram of the distributions are not outrageously skewed towards zero or one.

##Calculate Bayesian p-values

##Random Walk model
O = preds$Data  ## observed
pval.rw = 0
for(i in 1:ncol(x.cols.0)){
  pval.rw[i] = sum(O[i] > -x.cols.0[i,])/ncol(x.cols.0) 
}

plot(pval.rw)  

hist(pval.rw,probability=TRUE, main="Random Walk Model", xlab= "Bayesian p-value", ylab = "Frequency")

##Temperature model
pval.temp = 0
for(i in 1:ncol(x.cols.1)){
  pval.temp[i] = sum(O[i] > -x.cols.1[i,])/ncol(x.cols.1) 
}

plot(pval.temp)  

hist(pval.temp,probability=TRUE,main = "Model with temperature", xlab = "Bayesian p-value")

##Temperature, Pressure model
pval.temppres = 0
for(i in 1:ncol(x.cols.2)){
  pval.temppres[i] = sum(O[i] > -x.cols.2[i,])/ncol(x.cols.2) 
}

plot(pval.temppres)  

hist(pval.temppres,probability=TRUE,main="Deterministic Model", xlab= "Bayesian p-value")

From all of these analyses it is clear that the addition of air temperature and air pressure did not improve our predictions of log(ChlA). In fact, inclusion of the additional drivers made the models worse. This is not entirely surprising given the erractic nature of ChlA at our site. Additionally, it makes sense the ChlA on the current day would be very dependent on ChlA in the previous day. Additionally, there was not much variation in our ChlA data, with most values being extremely close to zero, which also made predictions challenging.

Finally, just to become familiar with the process of partitioning uncertainty, we did so for our best model: the null model.

##Partition Uncertainty of best model (random walk)


##create a function
forecastN <- function(IC, temp,Q=0,n=Nmc,NT=35){
  N <- matrix(NA,n,NT)  ## storage
  Nprev <- IC           ## initialize
  for(t in 1:NT){
 
    mu = Nprev   ## calculate mean
    N[,t] <- rnorm(n,mu,Q)                         
    Nprev <- N[,t]                                  
  }
  return(N)
}

##Make deterministic predictions
determ.for<-forecastN(IC=y[last.id],n=1)



##Add in initial condition uncertainty
ic.for<-forecastN(IC=out0[,last.id+8],n=2000)
ic.for.ci = apply(ic.for,2,quantile,c(0.025,0.5,0.975))

##Add in process uncertainty
Qmc <- 1/sqrt(out0[,"tau_add"])
process.ic.for<-forecastN(IC=out0[,last.id+8],n=2000,Q=Qmc)
process.ic.for.ci = apply(process.ic.for,2,quantile,c(0.025,0.5,0.975))

##Plot them all 


plot(time[1:last.id],ci0[2,][1:last.id],type='l'
     ,ylim=c(-10,5),ylab="log(Chla-a)",xlim=time[time.rng], 
     main = "Uncertainty Partitioning")
ecoforecastR::ciEnvelope(time[1:last.id],ci0[1,][1:last.id],ci0[3,][1:last.id],
                         col=ecoforecastR::col.alpha("lightBlue",0.75))

process.ic.for.ci = apply(process.ic.for,2,quantile,c(0.025,0.5,0.975))
ecoforecastR::ciEnvelope(tail(time,35),process.ic.for.ci[1,],process.ic.for.ci[3,],col=3)
ecoforecastR::ciEnvelope(tail(time,35),ic.for.ci[1,],ic.for.ci[3,],col=2)
lines(tail(time,35),determ.for,col=1,lwd=3)
legend("bottomleft",legend=c("Deterministic","Initial Condition","Process"),
       col=1:3,lwd=3,cex=0.5)

From this graph we can see that initial condition uncertainty stays constant, which makes sense since the random model is completely dependent on the conditions of the previous time-step. Process uncertainty grows, which also makes sense because each step of our random walk allows for further deviation from the mean.

Finally the code below formats our forecast and metadata for the forecasting challenge. Although the null model performed better on our model assessment based on the training set, we hope that a driver will improve future estimates. Therefore, we submitted our model with 2 drivers.

While we created metadata according to the EFI standards, the “generate_metadata” function is expired, so we submitted our forecast without associated metadata.

##Submit forecast 

##create a vector indicating the ensemble number
vector<-c()
for (i in 1:3000)
{add<-rep(i,length(y))
  vector<-c(vector,add)
}

##Create a vector of the dates we are forecasting
# defining start date
date <- Sys.Date()
  
# defining length of range 
len <- length(y)
  
# generating range of dates
time<-seq(date-(length(y)-35), by = "day", length.out = len)


##Format based on EFI standards
chla<- tibble(datetime = rep(time,3000),
              reference_datetime=date,
             site_id = "BARC",
             family = "ensemble",
             parameter = vector,
             predicted = as.vector(as.matrix(x.cols.2)),
              variable = "chla")


#Forecast output file name in standards requires for Challenge.  
forecast_file <- paste0("aquatics","-",min(time),"-",team_name,".csv.gz")

#Write csv to disk
write_csv(chla, forecast_file)

#Confirm that output file meets standard for Challenge
neon4cast::forecast_output_validator(forecast_file)
## aquatics-2021-01-01-chlorofun.csv.gz
## ✔ file name is correct
## ✔ forecasted variables found correct variable + predicted column
## ✔ file has ensemble distribution in family column
## ✔ file has parameter and family column with ensemble generated distribution
## ✔ file has site_id column
## ✔ file has time column
## ✔ file has correct time column
## ✔ file has reference_datetime column
## Forecast format is valid
## [1] TRUE
# Create metadata for forecast. 

model_metadata = list(
  forecast = list(
   model_description = list(
      forecast_model_id =  "chlorofun",  
      name = "Chla-A at Barc w/ air temperature and pressure", 
      type = "empirical",  
      repository = "https://github.com/zyu3/NDbio4cast/tree/main/Group_3e"  ##add in repo
    ),
    initial_conditions = list(
      status = "present"
    ),
    drivers = list(
      status = "propagates",
      complexity = 1, #Just temperature
      propagation = list( 
        type = "ensemble", 
        size = 1000) 
    ),
    parameters = list(
      status = "present"
    ),
    random_effects = list(
      status = "present"
    ),
    process_error = list(
      status = "present"
    ),
    obs_error = list(
      status = "present"
    )
  )
)


##Format metadata, generate_metadata no longer exist
#metadata_file <- neon4cast::generate_metadata(forecast_file, team_list) #model_metadata)


##Submit forecast 
neon4cast::submit(forecast_file = forecast_file, metadata = NULL, ask = FALSE)
## validating that file matches required standard
## aquatics-2021-01-01-chlorofun.csv.gz
## ✔ file name is correct✔ forecasted variables found correct variable + predicted column✔ file has ensemble distribution in family column✔ file has parameter and family column with ensemble generated distribution✔ file has site_id column✔ file has time column✔ file has correct time column✔ file has reference_datetime columnForecast format is valid
## Successfully submitted forecast to server